Making Websites Behave using Perl - The yjobs-proxy Story

Shlomi Fish on 2007-12-21T13:29:02

Here is another cool use for Perl: pre-processing the HTML/JS/CSS markup of poorly-written sites before it reaches the web browser. In this post, I'll tell the story of how I ended up writing the yjobs-proxy markup-transforming proxy using CPAN's HTTP-Proxy to make www.yjobs.co.il work with Firefox on my Linux system.

It all started when I was job-hunting, and was dismayed to discover that there were much fewer Info-Tech job ads in the newspaper's "Wanted Ads" section than there used to be. The section proudly announced that now it has an Internet counterpart - www.yjobs.co.il. But much to my disappointment, it didn't work in my Linux-based, open-source browsers.

I almost immediately thought of writing a Greasemonkey script to whip the JavaScript code there into a shape where it can work with Firefox. Eventually, I started writing it, and looked for a way to inject new declarations of JavaScript functions into the page, to replace the existing and broken ones. I found a way to do that, but it turned out to have some limitations due to the architecture of Greasemonkey and the way it interacts with the page.

After thinking about it for a moment, I realised I could achieve the same thing by transforming the code that Firefox receives from the site into a more agreeable version. So I thought of a transforming proxy. Someone here on use.perl.org mentioned HTTP::Proxy in one of his posts, so I went to check it out and see if it can solve my problems.

Meanwhile, I was distracted and delayed a bit by investigating this X Server bug. But then I resumed to work on the proxy. HTTP-Proxy turned out to be a great way to implement what I had in mind, but I still ran into a few problems. (Which weren't HTTP-Proxy's fault.).

The first one from what I recall was that it refused to filter JavaScript code. As it turned out yjobs sent the "Content-Type:" of the JavaScript code either as "application/x-javascript" or an undefined one, while I used "text/javascript". I ended up filtering them by the .js extension in the path, and by specifying a mime filter of "undef".

Then I ran into a problem where a variable called "Data" was assigned to, but not used anywhere else. As it turned out, my logging proxy, which I used to dump all the traffic, did not log the particular script where it was made use of. Maybe Firefox cached it. After that, I found out where it was used and used the Venkman JavaScript debugger to the problem I had getting it displayed on the page. It was fixed using a JavaScript transformation specific to that particular script.

Another problem I encountered was an original function was called despite the fact I overrided it in the bottom. As it turned out, this was caused because it was invoked before the JS interpreter reached the definition at the end. Like this code:

<html>
<body>
<script>
function mytest()
{
return "FirstFoo";
}

var myvar = mytest();
</script>

<h1 id="put_here">Put Here</h1>

<script>
document.getElementById("put_here").innerHTML = myvar;
</script>

<script>
function mytest()
{
return "Second";
}
</script>
</body>
</html>

This was resolved by transforming the JS code in the original function.

Eventually, I got it working enough. Then I cleaned up the proxy code, and released it for the world's consumption.

My future plans for this proxy, is to investigate a way to implement it as a Firefox extension that will be transform the markup from within Firefox.

A fellow Perl programmer I talked with on AIM that I pointed to the download page, said that "that's nucking futs, man" and then that "oh, it's cool. I just mean, that's pretty crazy. A proxy to make a site work... crazy. and awesome.". :-)

So this is one way Perl has given Power to the People. Hack on!